Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/integration: goroutine leak check #115456

Merged
merged 1 commit into from Feb 14, 2023

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Feb 1, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

The current approach with ignoring a certain number of additional goroutines is too vague. During testing, we want to ensure that no additional goroutines at all get left behind by a test.

Some goroutines being started during init are okay and go-uber/goleak can handle that, in contrast to the old in-tree code. For that reason and because goleak has nicer output, it replaces the in-tree code.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/dependency Issues or PRs related to dependency changes area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2023
@@ -228,3 +233,10 @@ func StartTestServer(t testing.TB, setup TestServerSetup) (client.Interface, *re

return kubeAPIServerClient, kubeAPIServerClientConfig, tearDownFn
}

type checkNothing struct{}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is (untested):

func init() {
   _ := healtz.LogHealthz.Check(nil)
}

That starts the goroutine during init and then goleak.IgnoreCurrent knows about it.

I'm undecided whether that is a better solution than replacing the instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we plumb a context on the handler to stop it on apiserver shutdown?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems doesn't work

	found unexpected goroutines:
	[Goroutine 2263 in state select, with k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil on top of the stack:
	goroutine 2263 [select]:
	k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x59f07c8?, {0x59be860, 0xc0092efbf0}, 0x1, 0xc0000925a0)
		/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:169 +0x135
	k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x59c0fa0?, 0xdf8475800, 0x0, 0x25?, 0xc00317ef00?)
		/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89
	k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)
		/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92
	k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Forever(0x0?, 0x0?)
		/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:83 +0x28
	created by k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/healthz.(*log).Check.func1
		/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/healthz/healthz.go:74 +0xe5
	]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we plumb a context on the handler to stop it on apiserver shutdown?

No, healthz.go initializes the variable during init. Goroutine startup then gets delayed until used for the first time.

it seems doesn't work

I'll check that tomorrow. So far, I was testing only with test/integration/scheduler_perf. make test-integration fails for me for various reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have multiple copies of StartTestServer. I modified the one in test/integration/framework/test_server.go, but not the one in cmd/kube-apiserver/app/testing/... let's do this all in etcd.go where the leak checking is done.

after := runtime.NumGoroutine()
stacktraces := make([]byte, 1<<20)
runtime.Stack(stacktraces, true)
klog.Fatalf("unexpected number of goroutines: before: %d after %d\n%sd", before, after, string(stacktraces))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that replacing this code also fixes a slight bug: the string that gets written here is 1<<20 bytes long.

The code should have used:

len := runtime.Stack(stacktraces, true)
... string(stacktraces[:len]) ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are awesome 👏

@alculquicondor
Copy link
Member

this is more on sig-testing side of things
/assign @aojea

@pohly
Copy link
Contributor Author

pohly commented Feb 1, 2023

this is more on sig-testing side of things

Yes, and the change to scheduler_perf_test.go was by accident (too much experientation and too many rebases...). I've taken that out.

go.mod Outdated
@@ -74,6 +74,7 @@ require (
go.opentelemetry.io/otel/sdk v1.10.0
go.opentelemetry.io/otel/trace v1.10.0
go.opentelemetry.io/proto/otlp v0.19.0
go.uber.org/goleak v1.2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is much better but current method is simple and "good enough" for our purposes, do we want to pull a dependency just for this? I defer to deps people to judge this

@liggitt @dims

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't want to add a dependency for this, especially not one that has any transitive dependencies

Copy link
Member

@dims dims Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have some code we can reorganize into a package if needed:

before := runtime.NumGoroutine()
stop, err := startEtcd()
if err != nil {
klog.Fatalf("cannot run integration tests: unable to start etcd: %v", err)
}
result := tests()
stop() // Don't defer this. See os.Exit documentation.
checkNumberOfGoroutines := func() (bool, error) {
// We leave some room for leaked goroutines as there are
// still some leaks, mostly:
// - leak from lumberjack package we're vendoring
// - leak from apiserve healthz
// - leak from opencensus library
// Once fixed, we should be able to bring it down to zero.
if dg := runtime.NumGoroutine() - before; dg <= 3 {
return true, nil
}
// Allow goroutines to schedule and die off.
runtime.Gosched()
return false, nil
}
// It generally takes visibly less than 1s to finish all goroutines.
// But we keep the limit higher to account for cpu-starved environments.
if err := wait.Poll(100*time.Millisecond, 5*time.Second, checkNumberOfGoroutines); err != nil {
after := runtime.NumGoroutine()
stacktraces := make([]byte, 1<<20)
runtime.Stack(stacktraces, true)
klog.Fatalf("unexpected number of goroutines: before: %d after %d\n%sd", before, after, string(stacktraces))
}

Copy link
Contributor Author

@pohly pohly Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current method is simple and "good enough" for our purposes,

It's not 😞

The current code doesn't support ignoring the goroutines that were started during init. That's crucial for reliably ignoring some of the goroutines that currently force us to do the fuzzy "3 goroutine leaks are okay" thing.

We do have some code we can reorganize into a package if needed:

I had started improving that code, but then realized that I would end up re-implementing goleak.

Is the problem with goleak the dependencies of the package? The actual code that we pull is pretty simple.

What about simply copying that code into test/utils/goleak?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check-dependency-stats doesn't look too bad to me:

Switched to a new branch 'base'
--- /logs/artifacts/stats-base.txt	2023-02-01 16:00:51.332458491 +0000
+++ /logs/artifacts/stats.txt	2023-02-01 16:00:50.944422049 +0000
@@ -1,6 +1,6 @@
 Direct Dependencies: 210 
-Transitive Dependencies: 235 
-Total Dependencies: 285 
+Transitive Dependencies: 236 
+Total Dependencies: 286 
 Max Depth Of Dependencies: 23 
 All dependencies:
 
@@ -249,6 +249,7 @@
 go.uber.org/multierr
 go.uber.org/zap
 golang.org/x/crypto
+golang.org/x/lint
 golang.org/x/mod
 golang.org/x/net
 golang.org/x/oauth2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration tests are now passing with the vague "allow 3 additional goroutines" replaced with more precise suppressions through goleak.Ignore*. I found that in two cases I needed goleak.IgnoreTopFunction. I'd prefer just using goleak as-is instead of investing further effort into improving our own code or forking goleak (regardless whether that's under "third_party/forked" or "test").

@liggitt: WDYT, is golang.org/x/lint as new additional dependency besides goleak itself acceptable?

Or is this about other dependencies that don't show up in go.mod and check-dependency-stats?

Copy link
Member

@liggitt liggitt Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt: WDYT, is golang.org/x/lint as new additional dependency besides goleak itself acceptable?

  1. golang.org/x/lint is marked deprecated and frozen... I don't have confidence it will be updated if required to continue working with new go versions, so I really don't want it in our graph (I opened Add golang.org/x/lint to unwanted dependencies #115488 to record that, since we're not currently depending on it). It looks like the leak library only depends on it because they explicitly take an _ import on it and run it in their make lint target. perhaps an upstream issue/contribution to drop that and scope that down to go vet would be accepted?

Or is this about other dependencies that don't show up in go.mod and check-dependency-stats?

  1. check-dependency-stats isn't surfacing it, but even though go.uber.org/goleak was in our graph before, it was unused... now it is actively used and vendored, so any changes / additions to its dependency graph impact us more directly now. It looks like the only dependency it currently has (aside from lint) is testify, which we also have, so there's not a current transitive dependency that gets pulled in that's problematic... that's more of an unknown/future growth concern.

  2. The last thought I have about this is that goroutine leak detection seems likely to potentially need tweaking around go releases... depending on a library to do this seems like it will be slower to react if changes are needed around a new go version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened uber-go/goleak#92 with a request / suggestion for dropping the golang.org/x/lint reference from the graph

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted a PR for goleak. Let's wait for that.

@pohly pohly changed the title test/integration: goroutine leak fixes + check WIP: test/integration: goroutine leak fixes + check Feb 1, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 1, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 2, 2023
@pohly pohly force-pushed the goroutine-leak-check branch 2 times, most recently from 5ea0042 to 7cff84f Compare February 2, 2023 18:32
// There's no good way to ignore this particular loop.
// Matching against Sleep is only indirect and not very
// precise.
goleak.IgnoreTopFunction("time.Sleep"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with this particular Ignore. However, adding some kind of forced shutdown to

for {
if recordEvent(sink, result.Event, result.Patch, result.Event.Count > 1, eventCorrelator) {
break
}
tries++
if tries >= maxTriesPerEvent {
klog.Errorf("Unable to write event '%#v' (retry limit exceeded!)", event)
break
}
// Randomize the first sleep so that various clients won't all be
// synced up if the master goes down.
if tries == 1 {
time.Sleep(time.Duration(float64(sleepDuration) * rand.Float64()))
} else {
time.Sleep(sleepDuration)
}
}
also doesn't seem very appealing.

I'm open for suggestions...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a context to

 recordToSink(ctx context.Context, sink EventSink, event *v1.Event, eventCorrelator *EventCorrelator, sleepDuration time.Duration)

and cancel the context on Shutdown

func (e *eventBroadcasterImpl) Shutdown() {
e.Broadcaster.Shutdown()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether that is possible. There are comments elsewhere (don't remember exactly where) that say that pending events will still get delivered after a shutdown, so the current behavior might be intentional. Changing it might break something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, adding a context to recordToSink then implies changing the StartEventWatcher API - let's not.

The problem is that the sink blocks when it should simply give up:

rsc.eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: rsc.kubeClient.CoreV1().Events("")})

There's a TODO from 2016 about EventSinkImpl:

https://github.com/kubernetes/kubernetes/blame/49ddc510bc6b57fa430f779372b71144d328995c/staging/src/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go#L154

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, adding a context to recordToSink then implies changing the StartEventWatcher API - let's not.

why, the context is internal and closed by Shutdown(), I didn't mean to expose it.

that say that pending events will still get delivered after a shutdown

Events are ephemeral, but we can give some grace period and try to drain, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventBroadcasterImpl doesn't have a context that it could pass to recordToSink. I guess it would be possible to create one just for this purpose. Let me try this and then we can ask others for their opinion whether it is a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed an update. What I meant with "change Watch API" is that the result of StartRecordingToSink (a watch.Interface) cannot be used to shutdown the watch and the delivery callback. That is harder to change.

Copy link
Contributor Author

@pohly pohly Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to have worked. I've turned the change into a separate PR here: #115514

@pohly
Copy link
Contributor Author

pohly commented Feb 2, 2023

/retest

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 7, 2023
@pohly pohly changed the title WIP: test/integration: goroutine leak check test/integration: goroutine leak check Feb 12, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 12, 2023
@k8s-ci-robot k8s-ci-robot added area/apiserver area/cloudprovider sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Feb 12, 2023
@pohly
Copy link
Contributor Author

pohly commented Feb 12, 2023

I rebased to get rid of the already merged EventBroadcaster.Shutdown commit.

I also updated to goleak master because that removes the problematic go/lint dependency. I've asked for a proper release. Do we need to wait for that?

I'd prefer to get this PR merged soonish because it blocks further work on scheduler_perf.

@pohly pohly force-pushed the goroutine-leak-check branch 3 times, most recently from 38baf1e to a939e71 Compare February 14, 2023 09:28
@pohly
Copy link
Contributor Author

pohly commented Feb 14, 2023

goleak v1.2.1 was release and this PR now uses it.

@liggitt: is this okay now as dependency?

If this ever causes problems during a Go update, it'll be trivial to comment out and restore later. It's only useful in tests and thus there will be no production code which could possibly depend on it.

It provides more readable output and has additional APIs for using it inside a
unit test. goleak.IgnoreCurrent is needed to filter out the goroutine that gets
started when importing go.opencensus.io/stats/view.

In order to handle background goroutines that get created on demand and cannot
be stopped (like the one for LogzHealth), a helper function ensures that those
are running before calling goleak.IgnoreCurrent. Keeping those goroutines
running is not a problem and thus not worth the effort of adding new APIs to
stop them.

Other goroutines are genuine leaks for which no fix is available. Those get
suppressed via IgnoreTopFunction, which works as long as that function
is unique enough.

Example output for the leak fixed in kubernetes#115423:

    E0202 09:30:51.641841   74789 etcd.go:205] "EtcdMain goroutine check" err=<
        found unexpected goroutines:
        [Goroutine 4889 in state chan receive, with k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop on top of the stack:
        goroutine 4889 [chan receive]:
        k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop(0xc0076183c0)
        	/nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/watch/mux.go:268 +0x65
        created by k8s.io/apimachinery/pkg/watch.NewBroadcaster
        	/nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/watch/mux.go:77 +0x116
    >
@liggitt
Copy link
Member

liggitt commented Feb 14, 2023

yeah, this is more reasonable as a dependency now, thanks for helping clean it up

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e293518a443aa90339a0a5551420998962b610a8

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2023
@pohly
Copy link
Contributor Author

pohly commented Feb 14, 2023

/retest

@k8s-ci-robot k8s-ci-robot merged commit 4cf352c into kubernetes:master Feb 14, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Feb 14, 2023
@cici37
Copy link
Contributor

cici37 commented Feb 14, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/cloudprovider area/dependency Issues or PRs related to dependency changes area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

8 participants